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I. 



EXECUTIVE SUMMARY 



A. PR03LEM 

At times even the most carefully designed and executed 
experiments can be plagued with aborted tests or missing 
data. Such unbalance in the data can have a significant 
impact upon the mathematical structure of analytic 
techniques used in analysis of, variance. In addition to 
increasing the complexity of computations, unbalanced design 
can also seriously affect hypothesis testing. 3ecause of 
lack of balance, hypotheses purporting to test the influence 
of a main effect, for example, may be hopelessly confounded 
with interaction terms. Blindly "testing" such confounded 
hypotheses without an appreciation of the level of pollution 
from extraneous terras can lead to serious error in 
interpreting results. It is desirable to find a general 
procedure for use with analysis of variance that can 
determine exactly what a proposed hypothesis is testing in 
terms of the main effects and interactions. 



B. APPROACH 



Because of its mathematical power and notational 
simplicity, the matrix form of the linear model Y = Xb + e 
is used in deriving a solution to the problem. The linear 
model leads to the "normal equations" X'Xb = X'Y. Since X’X 
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0 

is in general not of full rank, any solution (b ) for b is 

-1 

not unigue. Further (X'X) does not exist; one must turn to 
the concept of a generalized inverse G of X'X. It can be 
shown that testing a hypothesis H : g'b = m involves 

expressing the hypothesis as a linear function g'GX'X of the 

generalized inverse (G) and X'X. While determination of 
g'GX'X is frequently a non-trivial manual calculation, it 
can be handled easily on a computer. 

C. SOLOTION 

If an analyst needs to test a particular hypothesis it 
is possible that additional, undesired terms may be 
polluting the hypothesis to such a degree that his 
interpretation of test results may be completely invalid. By 
computing the value of g'GX'X he will be able to determine 
precisely what his proposed hypothesis is actually testing. 

D. CONCIOSICNS 

Recognizing that an unbalanced design can lead to 
difficulty in interpreting traditional tests of hypotheses, 
it is concluded that; 

1 . it is mathematically possible to determine the exact 
nature of a proposed hypothesis, and 

2. such a determination is feasible using a computer. 
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II 



MATHEMATICAL JUSTIFICATION 



A. GENERALIZED INVERSE 



A generalized inverse of a matrix A is defined to be any 
matrix G satisfying 

AGA = A. 

It can be shown that, for a given matrix A, G is in general 
not unique [Searle, 1971]. 

3. SOLUTION OF CONSISTENT LINEAR EQUATIONS 



The system of linear equations AX = Y 
any linear relationships existing among 
exist among the corresponding elements of 
aquations have a solution if and 
consistent, the procedures outlined below 
systems of consistent linear equations. 



is consistent if 
the rows of A also 
Y. Since linear 
only if they are 
are confined to 



The following theorems from Searle [ 1 ] are stated 
without proof in order to develop solution procedures for 
consistent equations. 



Theorem 1. Consistent equations AX = Y have a solution 
X = GY if and only if AGA = A. 



Theorem 

inverse 



2. if A has q columns and if G is a 
of A, then the consistent equations AX = 



generalized 
Y have the 
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solution 



0 

X = GY + (GA - I) Z 

where Z is any arbitrary vector of order q. The notation 

0 

indicates that X , which satisfies AX = Y, is a solution and 
not the general vector of unknowns X. 

Theorem 3. For the consistent equations AX = Y, all 

solutions are, for any specific G, generated by 
0 

X = GY + (GA - I) Z, for arbitrary Z. That is, one need 
derive only one generalized inverse of A in order to be able 
to develop all solutions to the system AX = Y. 



C. THE SPECIAL CASE OF SYMMETRIC MATRICES 

The linear model used, inter alia, in analysis of 
variance involves the system of consistent linear equations 

X'Xb = X'Y 

that are solved for b. It is therefore worthwhile to 
consider the special case of the symmetric matrix X'X in 
some detail. The following development is from Searle [1]. 

Lemma 1. X'X = 0 implies X = 0. 

Proof: This is true because if X'X = 0, the sums of squares 
of the elements of each row equal zero, hence must be zero. 

Lemma 2. PX • X = QX'X implies that PX' = QX'. 

Proof: Apply Lemma 1 to the identity 

(PX' X-QX' X) (P-Q) • = (PX'-QX') (PX'-QX') ' = 0. 

That is, (PX'-QX') (PX'-QX') ' = 0 implies that (PX'-QX') = 0 

which implies that PX ' = QX ' . 
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Theorem 4. When G is a generalized inverse of X'X, then 

i. G' is also a generalized inverse of X'X; 

ii. XGX'X = X (i.e., GX' is a generalized inverse of X) ; 

iii. XGX' is invariant to G. 

Proof : 



(i) By definition 

X' XGX' X = X' X 

transposing X'XG'X'X = X'X establishing (i) . 



(ii) 

by Lemma 2 
transposing 



X'XG'X'X = X'X 
X ' XG ' X * = X ' 

XGX'X = X establishing (ii) . 



(iii) Suppose F is some generalized inverse different 
from G. Then XGX'X = X = XFX'X. 3y Lemma 2 XGX' = XFX ' . 
That is, XGX' is the same for all generalized inverses of 
X'X, establishing (iii). 



D. THE LINEAE MODEL 



The general linear model is Y = Xb + e where Y is an 
n x 1 vector of observations whose components are random and 
observable; X is an n x p matrix of experimental design 
whose components are real and known; b is a p x 1 vector of 
parameters whose components are real and unknown; e is an 
n x 1 vector of experimental error whose components are 
random and unobservable. The vector e is defined as 

e = Y - E (Y) 

E(e) = E(Y) - E (E ( Y) ) = 0, 

and E (Y) = E (Xb) + E(e) 

E (Y) = Xb. 



2 

Every element in e is assumed to have the same variance v 



1 1 



and zero covariance with every other element, thus e is 

2 2 
distributed (0,v I) and Y is distributed (Xb,v I). Deriving 



the normal equations for the linear model yields 

X 'Xb = X'Y 



0 

which can be solved for b using the techniques 



generalized inverses described earlier, i.e., 

0 

b = GX'Y 
0 

and E (b ) = E (GX ' Y) 

= GX' 3 (Y) 



where 



= GX 1 Xb 
= Hb 

H = G X * X . 



of 



12 



III. THE CONCEPT OF ESTIMABILITY 



A. E ST It'l ABILITY 

As defined by Searle [1], a linear function g'b of the 
parameters in b is estimable if it is equal to any linear 
function [t'E(Y)] of the expected value of the ooservations 
in Y. It is important to note that t' is not in general 
unique; rhe only requirement for estimability is that such a 
vector exist. 



B. PROPERTIES 

The definition of estimability leads to four 
mathematical properties of immediate importance: 

(1) The expected value of any observation is estimable. 
In this case t' is a vector with a single element equal to 
one; the rest of its elements are zero. 

(2) Any linear combination of estimable functions is 

estimable. If g'b and r'b are estimable, then g'b = t'E(Y) 

and r'b = s'E(Y). Therefore c q'b + c r'b = (c t' + 

12 1 

c s')E(Y) which is estimable. 

2 

(3) An alternative form of the condition of estimability 
can be developed as follows. If g'b is estimable, then by 
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definition g'b = t'E(Y) hence g'b = t'Xb. This must hold for 
all values of b since the condition of estimability does not 
depend on a specific choice of b. This leads to the result 
q' = t'X. 

0 0 

(4) When g'b is estimable, g'b is invariant to the 

0 

solution used for b because 



0 


0 








g'b = t ' 


Xb = 


t ' XGX' Y. 






Since by Theorem 4, XGX' 


is 


invariant 


to 


0 

G , g'b is 


invariant, to G and therefore 


0 

to b 


when g' b 


is 


estimable. 



Herein lies the essential importance of estimability: if 

0 

g'b is estimable, g'b has the same value for all solutions 
0 

b . That is, an estimable function is a linear function of 



the parameters that is invariant to whatever solution is 
0 

used for b . 



C. THE TEST 



A function g'b is estimable if there exists some vector 
t' such that g' = t'X. Finding such a vector t ' may be a 
formidable task with a design of large dimensions. As an 
alternative, it is possible to test for estimability by 
determining if g'H = g' . Searle [1] shows that g'b is 
estimable if and only if g'H = g', as follows. 

If g'b is estimable 

g ' = t ' X 
g'H = t' XH 
g'H = t'XGX’X 

by Theorem 4 GX' is a generalized inverse of X, 
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hence 



g' H = t' X 
q ' H = q ' . 

On the other hand, if 

g' = g'H, 

g ’ = g'GX' X 

and g' = t'X for t' = g'GX'. 

D. THE CONSTBAINE D MODEL 



1 • d ev e lopm ent 

The normal equations X'Xb = X'Y form a consistent 
system of linear equations where X is of rank r<p. 3ecause 
X ' X is, in general, not of full rank, there are many 
solution vectors that will satisfy the system. In order to 

0 

obtain a particular solution b , additional constraints or 

the form Cb = 0 are often added to the model. A commonly 
used set of constraints satisfies the restrictions 

* the main effects sum to zero 

* the interaction effects sum to zero across each 
subscript . 

Adding the constraints Cb = 0, where the (p-r) rows 
of C are linearly independent of the rows of X, yields the 
following system of linear equations: 



Y 1 


X l 


... = 


. . . b + 


o 1 


c I j 



The constraint matrix C can be used to transform the design 

* 

matrix X into a constrained matrix X by performing basic 
row operations on the system of linear equations. 
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* 






x 1, 

.... 1 b + 




c [ 



* 

Note that X is the same size (n x p) as X; b remains 



unchanged. The practical effect of introducing the 
constraints into the design matrix is to make some of the 
* 

columns of X consist entirely of zeros. While b remains 



unchanged, the transformation of X into X has the effect of 
"deleting" some elements of the parameter vector by the 
mechanism of creating those columns of zeros. Once the 
constraints have been integrated into the design matrix, 

* 

transforming X into X , the constraints become redundant and 
can be removed from the model by the following technique. 
Let A ce a (n,n+p-r) matrix such that 
A = 0 if i # j 

ij 

A = 1 if i = j. 

ij 



Then multiplying by A, 



* 



1 Y 




X 




e 


• • • 

I 0 


= A 


*c* * 


b + A 


* 6’ 



yields the constrained linear model Y = X b + e, which is 
equivalent to the constrained system above. 



2 

Since e is assumed normal (0 , v I) , Y is also 

* * # * 
normal; E (Y) = X b. The normal equations, X *X b = X * Y, can 

0 

be solved for a particular solution b that will also 

* 

satisfy the original normal equations X'Xb = X*Y. If G is 

* * o * * 

defined as the generalized inverse of X 1 X then b = G X *Y 
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* * * 
and it follows that G X ' is a generalized inverse of X . 

* # * * 

Let H = G X ' X . 

It is stressed that this constrained linear model 
was developed solely for the purpose of finding some 

0 

particular solution vector b to the original system of 

£ 

normal equations. In the discussion that follows, X is the 

same size as X and the parameter vector b is the same in the 
constrained linear model as it was in the original linear 
model Y = Xb + e. 

2 . i^a mple 

£ 

As a simple example of the development of X , assume 
that the design matrix X is 

10 0 1 

10 10 
110 0 

0 111 . 

is 

10 0 1 

10 10 
110 0 
0 111 
Subtracting the bottom row from the top row yields 

1 - 1-1 0 
10 10 
110 0 
0 111 



Let C be 



Then 



X I 
• • I 
C | 
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which is 



* 

X 

• • • • 

C 

The appropriate A matrix is 

10 0 0 
0 10 0 

0 0 10 

* 

Multiplying yields X 

1 - 1-1 0 
10 10 
110 0 



E. BIOMEDICAL COMPUTER PROGRAMS 



The University of California publishes and maintains the 
BIOMED series of standard data analysis packages for use on 
digital computers [Dixon, 1576]. One of the programs within 
the package, BMD05V, performs computations for analysis of 
variance with the linear statistical model. The design 
matrix employed is not the same as the design matrix (X) in 
that model however. A user of 3MD05V is required to 
introduce appropriate additional constraints to permit 

0 

computing a particular solution (b ) for the parameter 

vector. It will be shown that techniques applicable to the 
design matrix X in the general linear model can be applied 
directly to the BMD05V design matrix. 
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F. 



APPLICATION TO BMD05Y 



The constraints, Cb = 0, added to the linear model in 
section D above are the type used to generate the BMD05V 

* 

design matrix. The resulting matrix X has the same number 

of columns as the original design matrix X, but because some 
of these columns are zero, it is possible to suppress them 
for arithmetic purposes. For computational simplicity, the 
matrix actually used by the computer program deletes all the 
zero columns and assumes a corresponding "reparameterized" b 
vector of lower dimension. For mathematical rigor, however, 
* 

the X used in the following sections retains the same 

number of columns as X. This restriction will be eased when 
* 

the matrix X is actually applied to the computer programs 
in Appendix B. 



G. ESTIMABILITY IN THE CONSTRAINED MODEL 



It can be shown that estimability in the constrained 
* 

model Y = X b + e follows the same pattern as estimability 
in the full model Y = Xb + e. 



Theorem 5. g’b is estimable 
Proof: By definition, g'b is 

g'b = t'E (Y) , 



* 

if and only if g'H 
estimable if 
i. e . , if 



g' ♦ 
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and if 



Then 



* 

q' b = t'X b, i .e . , if 
* 

g 1 = t'X . Then 

# * # * a{s 

q'H = t'X G X 'X , 

* * 

q'H = t'X = q ' , 

q' = q'H*, 

* * * * * 
q' = q'G X 'X . Let t' = q'G X ' . 

* 

q' = t'X . 



This result allows computations to be performed 
directly upon the constrained matrix in order to examine the 
est inability of proposed hypotheses. The computer program 
HYTEST (Appendix A) can accept either the constrained design 

matrix X (with "zero" columns suppressed) or the standard 

design matrix X as an input. If X is used for input, HYTEST 

* 

offers the option of using either the constrained matrix X 
or the standard matrix X to compute tests of estimability . 



* 

Note that q'H b is always estimable since 
q'G X 'X b = q'G X 'E(Y) = t'E(Y) where t' = q'G X 



* 

q'H b = 



H. TESTABILITY 



1 . the hypot hes is 



From searle [1], all 
handled by a general procedure; 



linear hypotheses can be 
specific hypotheses are then 
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considered to be applications of this general procedure. 



The general hypothesis may 
where K' is a matrix of s rows and 



be 

P 



written H : K'b = m 
0 

columns. The only 



limitation on K' is that it have full row rank. That is, the 
hypothesis must be composed of linearly independent 
functions of the parameter vector. 



2. analysis of variance 



To review analysis of variance briefly, classical 
techniques rely upon the ratio of two independent Chi-square 
distr ibutions, each divided by its respective degrees of 
freedom, to generate an F statistic. The sum of squares 
explained by the model if the hypothesis is assumed true, 
divided by its degrees of freedom forms rhe numerator cf the 
statistic. For many situations, the denominator is the sum 
of squares for error divided by its degrees of freedom. 
Each sum or squares can be conveniently represented by 
appropriate quadratic forms which must meet certain 
requirements in order to be Chi-square distributed. 



Searle's derivation of a test of the general 
hypothesis depends upon K'b being estimable for every row 



k ' b. If this assumption is satisfied, the quadratic form 
i 



20 -102 
Q/v = (K'b - m)'(K'GK) (K'b - m) /v 

is distributed non-central Chi-square and has rank s. The 

sum of squares for error can be shown to be 

-1 -1 

SSE = (Y-XK(K'K) m) ' (I-XGX* ) (Y-XK(K'K) m) . 

Q and SSE are independent so F(H) = Q/s/SSE/ (n- r) is 
distributed non-central 
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